In [39]:
import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn import metrics
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
In [245]:
df=pd.read_excel("CaseStudy_Cancer.xls")
In [130]:
df.head(2)
Out[130]:
ID B-M radius texture perimeter area smoothness compactness concavity concave points ... radius-W texture-W perimeter-W area-W smoothness-W compactness-W concavity-W concave points-W Symmetry-W fractal dimension-W
0 842302 M 17.99 10.38 122.8 1001.0 0.11840 0.27760 0.3001 0.14710 ... 25.38 17.33 184.6 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 842517 M 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869 0.07017 ... 24.99 23.41 158.8 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902

2 rows × 32 columns

In [42]:
df.shape
Out[42]:
(569, 32)
In [43]:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x14e15550>
In [69]:
df[df.isnull().any(axis=1)]
Out[69]:
ID B-M radius texture perimeter area smoothness compactness concavity concave points ... radius-W texture-W perimeter-W area-W smoothness-W compactness-W concavity-W concave points-W Symmetry-W fractal dimension-W

0 rows × 32 columns

In [246]:
df.drop('ID',axis =1,inplace=True)
# Drop the Id sequence as that is just a random number 
In [71]:
df.dtypes
Out[71]:
B-M                      object
radius                  float64
texture                 float64
perimeter               float64
area                    float64
smoothness              float64
compactness             float64
concavity               float64
concave points          float64
Symmetry                float64
fractal dimension       float64
SE-radius               float64
texture-SE              float64
perimeter-SE            float64
area-SE                 float64
smoothness-SE           float64
compactness-SE          float64
concavity-SE            float64
concave points-SE       float64
Symmetry-SE             float64
fractal dimension-SE    float64
radius-W                float64
texture-W               float64
perimeter-W             float64
area-W                  float64
smoothness-W            float64
compactness-W           float64
concavity-W             float64
concave points-W        float64
Symmetry-W              float64
fractal dimension-W     float64
dtype: object
In [254]:
#cleanup_nums = {"B-M":     {"M": 1, "B": 0}}
#df.replace(cleanup_nums, inplace=True)
In [255]:
#df.dtypes
In [21]:
#heat map for correlation just to know the  quantified relation between the variable
#calculate the correlation matrix
corr = df.corr()
cmap = cmap=sns.diverging_palette(5, 250, as_cmap=True)


#draw the correlation table 
def magnify():
    return [dict(selector="th",
                 props=[("font-size", "7pt")]),
            dict(selector="td",
                 props=[('padding', "0em 0em")]),
            dict(selector="th:hover",
                 props=[("font-size", "12pt")]),
            dict(selector="tr:hover td:hover",
                 props=[('max-width', '200px'),
                        ('font-size', '12pt')])
]

corr.style.background_gradient(cmap, axis=1)\
    .set_properties(**{'max-width': '80px', 'font-size': '10pt'})\
    .set_caption("Hover to magify")\
    .set_precision(2)\
    .set_table_styles(magnify())
Out[21]:
Hover to magify
B-M radius texture perimeter area smoothness compactness concavity concave points Symmetry fractal dimension SE-radius texture-SE perimeter-SE area-SE smoothness-SE compactness-SE concavity-SE concave points-SE Symmetry-SE fractal dimension-SE radius-W texture-W perimeter-W area-W smoothness-W compactness-W concavity-W concave points-W Symmetry-W fractal dimension-W
B-M 1 0.73 0.42 0.74 0.71 0.36 0.6 0.7 0.78 0.33 -0.013 0.57 -0.0083 0.56 0.55 -0.067 0.29 0.25 0.41 -0.0065 0.078 0.78 0.46 0.78 0.73 0.42 0.59 0.66 0.79 0.42 0.32
radius 0.73 1 0.32 1 0.99 0.17 0.51 0.68 0.82 0.15 -0.31 0.68 -0.097 0.67 0.74 -0.22 0.21 0.19 0.38 -0.1 -0.043 0.97 0.3 0.97 0.94 0.12 0.41 0.53 0.74 0.16 0.0071
texture 0.42 0.32 1 0.33 0.32 -0.023 0.24 0.3 0.29 0.071 -0.076 0.28 0.39 0.28 0.26 0.0066 0.19 0.14 0.16 0.0091 0.054 0.35 0.91 0.36 0.34 0.078 0.28 0.3 0.3 0.11 0.12
perimeter 0.74 1 0.33 1 0.99 0.21 0.56 0.72 0.85 0.18 -0.26 0.69 -0.087 0.69 0.74 -0.2 0.25 0.23 0.41 -0.082 -0.0055 0.97 0.3 0.97 0.94 0.15 0.46 0.56 0.77 0.19 0.051
area 0.71 0.99 0.32 0.99 1 0.18 0.5 0.69 0.82 0.15 -0.28 0.73 -0.066 0.73 0.8 -0.17 0.21 0.21 0.37 -0.072 -0.02 0.96 0.29 0.96 0.96 0.12 0.39 0.51 0.72 0.14 0.0037
smoothness 0.36 0.17 -0.023 0.21 0.18 1 0.66 0.52 0.55 0.56 0.58 0.3 0.068 0.3 0.25 0.33 0.32 0.25 0.38 0.2 0.28 0.21 0.036 0.24 0.21 0.81 0.47 0.43 0.5 0.39 0.5
compactness 0.6 0.51 0.24 0.56 0.5 0.66 1 0.88 0.83 0.6 0.57 0.5 0.046 0.55 0.46 0.14 0.74 0.57 0.64 0.23 0.51 0.54 0.25 0.59 0.51 0.57 0.87 0.82 0.82 0.51 0.69
concavity 0.7 0.68 0.3 0.72 0.69 0.52 0.88 1 0.92 0.5 0.34 0.63 0.076 0.66 0.62 0.099 0.67 0.69 0.68 0.18 0.45 0.69 0.3 0.73 0.68 0.45 0.75 0.88 0.86 0.41 0.51
concave points 0.78 0.82 0.29 0.85 0.82 0.55 0.83 0.92 1 0.46 0.17 0.7 0.021 0.71 0.69 0.028 0.49 0.44 0.62 0.095 0.26 0.83 0.29 0.86 0.81 0.45 0.67 0.75 0.91 0.38 0.37
Symmetry 0.33 0.15 0.071 0.18 0.15 0.56 0.6 0.5 0.46 1 0.48 0.3 0.13 0.31 0.22 0.19 0.42 0.34 0.39 0.45 0.33 0.19 0.091 0.22 0.18 0.43 0.47 0.43 0.43 0.7 0.44
fractal dimension -0.013 -0.31 -0.076 -0.26 -0.28 0.58 0.57 0.34 0.17 0.48 1 0.00011 0.16 0.04 -0.09 0.4 0.56 0.45 0.34 0.35 0.69 -0.25 -0.051 -0.21 -0.23 0.5 0.46 0.35 0.18 0.33 0.77
SE-radius 0.57 0.68 0.28 0.69 0.73 0.3 0.5 0.63 0.7 0.3 0.00011 1 0.21 0.97 0.95 0.16 0.36 0.33 0.51 0.24 0.23 0.72 0.19 0.72 0.75 0.14 0.29 0.38 0.53 0.095 0.05
texture-SE -0.0083 -0.097 0.39 -0.087 -0.066 0.068 0.046 0.076 0.021 0.13 0.16 0.21 1 0.22 0.11 0.4 0.23 0.19 0.23 0.41 0.28 -0.11 0.41 -0.1 -0.083 -0.074 -0.092 -0.069 -0.12 -0.13 -0.046
perimeter-SE 0.56 0.67 0.28 0.69 0.73 0.3 0.55 0.66 0.71 0.31 0.04 0.97 0.22 1 0.94 0.15 0.42 0.36 0.56 0.27 0.24 0.7 0.2 0.72 0.73 0.13 0.34 0.42 0.55 0.11 0.085
area-SE 0.55 0.74 0.26 0.74 0.8 0.25 0.46 0.62 0.69 0.22 -0.09 0.95 0.11 0.94 1 0.075 0.28 0.27 0.42 0.13 0.13 0.76 0.2 0.76 0.81 0.13 0.28 0.39 0.54 0.074 0.018
smoothness-SE -0.067 -0.22 0.0066 -0.2 -0.17 0.33 0.14 0.099 0.028 0.19 0.4 0.16 0.4 0.15 0.075 1 0.34 0.27 0.33 0.41 0.43 -0.23 -0.075 -0.22 -0.18 0.31 -0.056 -0.058 -0.1 -0.11 0.1
compactness-SE 0.29 0.21 0.19 0.25 0.21 0.32 0.74 0.67 0.49 0.42 0.56 0.36 0.23 0.42 0.28 0.34 1 0.8 0.74 0.39 0.8 0.2 0.14 0.26 0.2 0.23 0.68 0.64 0.48 0.28 0.59
concavity-SE 0.25 0.19 0.14 0.23 0.21 0.25 0.57 0.69 0.44 0.34 0.45 0.33 0.19 0.36 0.27 0.27 0.8 1 0.77 0.31 0.73 0.19 0.1 0.23 0.19 0.17 0.48 0.66 0.44 0.2 0.44
concave points-SE 0.41 0.38 0.16 0.41 0.37 0.38 0.64 0.68 0.62 0.39 0.34 0.51 0.23 0.56 0.42 0.33 0.74 0.77 1 0.31 0.61 0.36 0.087 0.39 0.34 0.22 0.45 0.55 0.6 0.14 0.31
Symmetry-SE -0.0065 -0.1 0.0091 -0.082 -0.072 0.2 0.23 0.18 0.095 0.45 0.35 0.24 0.41 0.27 0.13 0.41 0.39 0.31 0.31 1 0.37 -0.13 -0.077 -0.1 -0.11 -0.013 0.06 0.037 -0.03 0.39 0.078
fractal dimension-SE 0.078 -0.043 0.054 -0.0055 -0.02 0.28 0.51 0.45 0.26 0.33 0.69 0.23 0.28 0.24 0.13 0.43 0.8 0.73 0.61 0.37 1 -0.037 -0.0032 -0.001 -0.023 0.17 0.39 0.38 0.22 0.11 0.59
radius-W 0.78 0.97 0.35 0.97 0.96 0.21 0.54 0.69 0.83 0.19 -0.25 0.72 -0.11 0.7 0.76 -0.23 0.2 0.19 0.36 -0.13 -0.037 1 0.36 0.99 0.98 0.22 0.48 0.57 0.79 0.24 0.093
texture-W 0.46 0.3 0.91 0.3 0.29 0.036 0.25 0.3 0.29 0.091 -0.051 0.19 0.41 0.2 0.2 -0.075 0.14 0.1 0.087 -0.077 -0.0032 0.36 1 0.37 0.35 0.23 0.36 0.37 0.36 0.23 0.22
perimeter-W 0.78 0.97 0.36 0.97 0.96 0.24 0.59 0.73 0.86 0.22 -0.21 0.72 -0.1 0.72 0.76 -0.22 0.26 0.23 0.39 -0.1 -0.001 0.99 0.37 1 0.98 0.24 0.53 0.62 0.82 0.27 0.14
area-W 0.73 0.94 0.34 0.94 0.96 0.21 0.51 0.68 0.81 0.18 -0.23 0.75 -0.083 0.73 0.81 -0.18 0.2 0.19 0.34 -0.11 -0.023 0.98 0.35 0.98 1 0.21 0.44 0.54 0.75 0.21 0.08
smoothness-W 0.42 0.12 0.078 0.15 0.12 0.81 0.57 0.45 0.45 0.43 0.5 0.14 -0.074 0.13 0.13 0.31 0.23 0.17 0.22 -0.013 0.17 0.22 0.23 0.24 0.21 1 0.57 0.52 0.55 0.49 0.62
compactness-W 0.59 0.41 0.28 0.46 0.39 0.47 0.87 0.75 0.67 0.47 0.46 0.29 -0.092 0.34 0.28 -0.056 0.68 0.48 0.45 0.06 0.39 0.48 0.36 0.53 0.44 0.57 1 0.89 0.8 0.61 0.81
concavity-W 0.66 0.53 0.3 0.56 0.51 0.43 0.82 0.88 0.75 0.43 0.35 0.38 -0.069 0.42 0.39 -0.058 0.64 0.66 0.55 0.037 0.38 0.57 0.37 0.62 0.54 0.52 0.89 1 0.86 0.53 0.69
concave points-W 0.79 0.74 0.3 0.77 0.72 0.5 0.82 0.86 0.91 0.43 0.18 0.53 -0.12 0.55 0.54 -0.1 0.48 0.44 0.6 -0.03 0.22 0.79 0.36 0.82 0.75 0.55 0.8 0.86 1 0.5 0.51
Symmetry-W 0.42 0.16 0.11 0.19 0.14 0.39 0.51 0.41 0.38 0.7 0.33 0.095 -0.13 0.11 0.074 -0.11 0.28 0.2 0.14 0.39 0.11 0.24 0.23 0.27 0.21 0.49 0.61 0.53 0.5 1 0.54
fractal dimension-W 0.32 0.0071 0.12 0.051 0.0037 0.5 0.69 0.51 0.37 0.44 0.77 0.05 -0.046 0.085 0.018 0.1 0.59 0.44 0.31 0.078 0.59 0.093 0.22 0.14 0.08 0.62 0.81 0.69 0.51 0.54 1
In [75]:
df['B-M'].value_counts().plot(kind='bar',color='purple')
plt.title("Diagnosis Details")
plt.ylabel('Diagnosis counts')
plt.xlabel('Diagnosis type');
In [93]:
#Univariate graphs for each attribute grouped by class variable
fig = plt.figure()
fig.set_figheight(5)
fig.set_figwidth(15)
num_bins = 10

#radius
ax1 = fig.add_subplot(331)
ax1.hist(np.array(df[df['B-M']=='B']['radius']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['radius']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("radius")

#texture
ax1 = fig.add_subplot(332)
ax1.hist(np.array(df[df['B-M']=='B']['texture']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['texture']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("texture")

#texture
ax1 = fig.add_subplot(333)
ax1.hist(np.array(df[df['B-M']=='B']['perimeter']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['perimeter']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("perimeter")

#area
ax1 = fig.add_subplot(334)
ax1.hist(np.array(df[df['B-M']=='B']['area']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['area']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("area")

#smoothness
ax1 = fig.add_subplot(335)
ax1.hist(np.array(df[df['B-M']=='B']['smoothness']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['smoothness']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("smoothness")

#compactness
ax1 = fig.add_subplot(336)
ax1.hist(np.array(df[df['B-M']=='B']['compactness']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['compactness']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("compactness")

#concavity
ax1 = fig.add_subplot(337)
ax1.hist(np.array(df[df['B-M']=='B']['concavity']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['concavity']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("concavity")

#concave points
ax1 = fig.add_subplot(338)
ax1.hist(np.array(df[df['B-M']=='B']['concave points']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['concave points']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("concave points")

#Symmetry
ax1 = fig.add_subplot(339)
ax1.hist(np.array(df[df['B-M']=='B']['Symmetry']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['Symmetry']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("Symmetry")

plt.tight_layout()
plt.show()
In [99]:
#Univariate graphs for each attribute grouped by class variable
fig = plt.figure()
fig.set_figheight(5)
fig.set_figwidth(15)
num_bins = 10

#fractal dimension
ax1 = fig.add_subplot(331)
ax1.hist(np.array(df[df['B-M']=='B']['fractal dimension']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['fractal dimension']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("fractal dimension")

#SE-radius
ax1 = fig.add_subplot(332)
ax1.hist(np.array(df[df['B-M']=='B']['SE-radius']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['SE-radius']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("SE-radius")

#texture-SE
ax1 = fig.add_subplot(333)
ax1.hist(np.array(df[df['B-M']=='B']['texture-SE']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['texture-SE']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("texture-SE")

#perimeter-SE
ax1 = fig.add_subplot(334)
ax1.hist(np.array(df[df['B-M']=='B']['perimeter-SE']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['perimeter-SE']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("perimeter-SE")

#area-SE
ax1 = fig.add_subplot(335)
ax1.hist(np.array(df[df['B-M']=='B']['area-SE']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['area-SE']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("area-SE")

#smoothness-SE
ax1 = fig.add_subplot(336)
ax1.hist(np.array(df[df['B-M']=='B']['smoothness-SE']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['smoothness-SE']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("smoothness-SE")

#compactness-SE
ax1 = fig.add_subplot(337)
ax1.hist(np.array(df[df['B-M']=='B']['compactness-SE']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['compactness-SE']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("compactness-SE")

#concavity-SE
ax1 = fig.add_subplot(338)
ax1.hist(np.array(df[df['B-M']=='B']['concavity-SE']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['concavity-SE']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("concavity-SE")

#concave points-SE
ax1 = fig.add_subplot(339)
ax1.hist(np.array(df[df['B-M']=='B']['concave points-SE']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['concave points-SE']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("concave points-SE")

plt.tight_layout()
plt.show()
In [101]:
#Univariate graphs for each attribute grouped by class variable
fig = plt.figure()
fig.set_figheight(5)
fig.set_figwidth(15)
num_bins = 10

#Symmetry-SE
ax1 = fig.add_subplot(331)
ax1.hist(np.array(df[df['B-M']=='B']['Symmetry-SE']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['Symmetry-SE']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("Symmetry-SE")

#fractal dimension-SE
ax1 = fig.add_subplot(332)
ax1.hist(np.array(df[df['B-M']=='B']['fractal dimension-SE']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['fractal dimension-SE']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("fractal dimension-SE")

#perimeter-W
ax1 = fig.add_subplot(333)
ax1.hist(np.array(df[df['B-M']=='B']['perimeter-W']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['perimeter-W']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("perimeter-W")

#radius-W
ax1 = fig.add_subplot(334)
ax1.hist(np.array(df[df['B-M']=='B']['radius-W']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['radius-W']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("radius-W")

#texture-W
ax1 = fig.add_subplot(335)
ax1.hist(np.array(df[df['B-M']=='B']['texture-W']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['texture-W']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("texture-W")

#area-W
ax1 = fig.add_subplot(336)
ax1.hist(np.array(df[df['B-M']=='B']['area-W']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['area-W']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("area-W")

#smoothness-W
ax1 = fig.add_subplot(337)
ax1.hist(np.array(df[df['B-M']=='B']['smoothness-W']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['smoothness-W']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("smoothness-W")

#compactness-W
ax1 = fig.add_subplot(338)
ax1.hist(np.array(df[df['B-M']=='B']['compactness-W']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['compactness-W']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("compactness-W")

#concavity-W
ax1 = fig.add_subplot(339)
ax1.hist(np.array(df[df['B-M']=='B']['concavity-W']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['concavity-W']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("concavity-W")

plt.tight_layout()
plt.show()
In [102]:
#Univariate graphs for each attribute grouped by class variable
fig = plt.figure()
fig.set_figheight(5)
fig.set_figwidth(15)
num_bins = 10

#concave points-W
ax1 = fig.add_subplot(331)
ax1.hist(np.array(df[df['B-M']=='B']['concave points-W']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['concave points-W']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("concave points-W")

#fractal dimension-SE
ax1 = fig.add_subplot(332)
ax1.hist(np.array(df[df['B-M']=='B']['Symmetry-W']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['Symmetry-W']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("Symmetry-W")

#perimeter-W
ax1 = fig.add_subplot(333)
ax1.hist(np.array(df[df['B-M']=='B']['fractal dimension-W']), num_bins, normed=0, facecolor='blue', alpha=0.5,label="1")
ax1.hist(np.array(df[df['B-M']=='M']['fractal dimension-W']), num_bins, normed=0, facecolor='red', alpha=0.5,label="0")
plt.legend(loc='upper right')
plt.title("fractal dimension-W")

plt.tight_layout()
plt.show()

Now from these histograms we see that features like- mean fractal dimension has very little role to play in discerning malignant from benign, but worst concave points or worst perimeter are useful features that can give us strong hint about the classes of this cancer data-set. Histogram plots are essential as they are often used to separate models. I couldn’t resist the temptation to bring it up here. So if your data has only one feature e.g. worst perimeter, it can be good enough to separate malignant from benign case.

In [103]:
sns.pairplot(df,hue='B-M',palette='Set1')
Out[103]:
<seaborn.axisgrid.PairGrid at 0x1749a5f8>
In [250]:
plt.scatter(df['Symmetry'], df['texture'],s=df['area']*0.05, color='magenta', label='check', alpha=0.3)
plt.xlabel('Symmetry',fontsize=12)
plt.ylabel('Texture',fontsize=12)
plt.tight_layout()
plt.show()
In [252]:
plt.scatter(df['radius'], df['concave points'], s=df['area']*0.05, color='purple', label='check', alpha=0.3)
plt.xlabel('Radius',fontsize=12)
plt.ylabel('Concave Points',fontsize=12)
plt.tight_layout()
plt.show()
In [133]:
cleanup_nums = {"B-M":     {"M": 1, "B": 0}}
df.replace(cleanup_nums, inplace=True)
In [134]:
# Split the data into training and test set in the ratio of 70:30
n=df['B-M'].count()
train_set = df.head(int(round(n*0.7))) 
test_set = df.tail(int(round(n*0.3)))

# capture the target column ("class") into separate vectors for training set and test set
train_labels = train_set.pop("B-M")
test_labels = test_set.pop("B-M")
In [135]:
# Create the model using entropy method of reducing the entropy and fit it to training data
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier(criterion = 'entropy' )
In [136]:
#Fit the model
dt_model.fit(train_set, train_labels)
Out[136]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
In [137]:
# Test the model on test data
dt_model.score(test_set , test_labels)
Out[137]:
0.935672514619883
In [138]:
#7. Test the model on test data
dt_model.score(test_set , test_labels)
Out[138]:
0.935672514619883
In [139]:
test_pred = dt_model.predict(test_set)
y_grid = (np.column_stack([test_pred, test_labels]))
In [140]:
#Generate Cross tab
result = pd.DataFrame(y_grid)
result.columns= ["Predicted","Actual"]
pd.crosstab(result.Predicted,result.Actual, margins = True)
Out[140]:
Actual 0 1 All
Predicted
0 122 1 123
1 10 38 48
All 132 39 171
In [141]:
from IPython.display import Image  
from sklearn import tree
from os import system
train_char_label = ['No', 'Yes']
Cancer_Tree_File = open('Cancer_tree.dot','w')
dot_data = tree.export_graphviz(dt_model, out_file=Cancer_Tree_File, feature_names = list(train_set), class_names = list(train_char_label))

Cancer_Tree_File.close()

print (pd.DataFrame(dt_model.feature_importances_, columns = ["Imp"], index = train_set.columns))
                           Imp
radius                0.000000
texture               0.042109
perimeter             0.000000
area                  0.000000
smoothness            0.000000
compactness           0.000000
concavity             0.000000
concave points        0.000000
Symmetry              0.000000
fractal dimension     0.000000
SE-radius             0.021081
texture-SE            0.000000
perimeter-SE          0.000000
area-SE               0.000000
smoothness-SE         0.013958
compactness-SE        0.009267
concavity-SE          0.000000
concave points-SE     0.000000
Symmetry-SE           0.000000
fractal dimension-SE  0.012299
radius-W              0.000000
texture-W             0.046149
perimeter-W           0.691598
area-W                0.000000
smoothness-W          0.060368
compactness-W         0.000000
concavity-W           0.000000
concave points-W      0.092995
Symmetry-W            0.010176
fractal dimension-W   0.000000

#Would you get the same result if you recreate the training and test data using random function?

In [142]:
from sklearn.model_selection import train_test_split
X = df.drop(['B-M'], axis=1)
Y = df[['B-M']]
test_size = 0.30
seed = 7  
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
In [143]:
# Use regularization parameters of max_depth, min_sample_leaf to recreate the model. What is the impact on the model accuracy. How does regularization help? Implement the same using Random Forest.
cancer_model = DecisionTreeClassifier(criterion = 'entropy', class_weight={0:.80,1:.20}, max_depth = 9, min_samples_leaf=5 )
In [144]:
cancer_model.fit(X_train,y_train)
Out[144]:
DecisionTreeClassifier(class_weight={0: 0.8, 1: 0.2}, criterion='entropy',
            max_depth=9, max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
In [145]:
cancer_model.score(X_test, y_test)
Out[145]:
0.9181286549707602
In [146]:
predictions = cancer_model.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
             precision    recall  f1-score   support

          0       0.91      0.97      0.94       116
          1       0.94      0.80      0.86        55

avg / total       0.92      0.92      0.92       171

In [147]:
print(confusion_matrix(y_test,predictions))
[[113   3]
 [ 11  44]]

Test results are not the same when training and test data are recreated using random function as accuracy score has come down

-----------------iteration2--------------

In [148]:
from sklearn import preprocessing

X_train_scaled = preprocessing.scale(X_train)
X_test_scaled = preprocessing.scale(X_test)
In [149]:
dt_model2 = DecisionTreeClassifier(criterion = 'entropy', class_weight={0:.5,1:.5}, max_depth = 5, min_samples_leaf=5 )
dt_model2.fit(train_set, train_labels)
Out[149]:
DecisionTreeClassifier(class_weight={0: 0.5, 1: 0.5}, criterion='entropy',
            max_depth=5, max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
In [150]:
Cancer_tree_regularized = open('Cancer_tree_regularized.dot','w')
dot_data = tree.export_graphviz(dt_model2, out_file= Cancer_tree_regularized , feature_names = list(train_set), class_names = list(train_char_label))

Cancer_tree_regularized.close()

print (pd.DataFrame(dt_model2.feature_importances_, columns = ["Imp"], index = train_set.columns))
                           Imp
radius                0.004991
texture               0.044004
perimeter             0.000000
area                  0.000000
smoothness            0.000000
compactness           0.000000
concavity             0.000000
concave points        0.003256
Symmetry              0.000000
fractal dimension     0.000000
SE-radius             0.000000
texture-SE            0.000000
perimeter-SE          0.019796
area-SE               0.000000
smoothness-SE         0.000000
compactness-SE        0.000000
concavity-SE          0.000000
concave points-SE     0.000000
Symmetry-SE           0.000000
fractal dimension-SE  0.000000
radius-W              0.000000
texture-W             0.048188
perimeter-W           0.722723
area-W                0.000000
smoothness-W          0.063085
compactness-W         0.000000
concavity-W           0.000000
concave points-W      0.093958
Symmetry-W            0.000000
fractal dimension-W   0.000000
In [151]:
test_pred = dt_model2.predict(test_set)
dt_model2.score(test_set , test_labels)
#There is not much improvement
Out[151]:
0.935672514619883

----------------random forest---------------

In [155]:
from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(criterion = 'entropy', class_weight={0:.5,1:.5}, max_depth = 5, min_samples_leaf=5)
rfcl = rfcl.fit(train_set, train_labels)
In [156]:
test_pred = rfcl.predict(test_set)
rfcl.score(test_set , test_labels)
Out[156]:
0.9649122807017544
In [157]:
#10.What is the optimal number of trees that gives the best result?
model = RandomForestClassifier(n_jobs=-1, criterion = 'entropy')

estimators = np.arange(9, 201, 12)
scores = []
for n in estimators:
    model.set_params(n_estimators=n)
    model.fit(train_set, train_labels)
    scores.append(model.score(test_set, test_labels))
plt.title("Effect of n_estimators")
plt.xlabel("n_estimator")
plt.ylabel("score")
plt.plot(estimators, scores)
plt.show()
In [158]:
tree_array = [estimators,scores]
trees = pd.DataFrame(tree_array).transpose()
trees.columns = ["Trees","Scores"]
trees
Out[158]:
Trees Scores
0 9.0 0.964912
1 21.0 0.959064
2 33.0 0.964912
3 45.0 0.964912
4 57.0 0.964912
5 69.0 0.964912
6 81.0 0.970760
7 93.0 0.947368
8 105.0 0.970760
9 117.0 0.970760
10 129.0 0.976608
11 141.0 0.988304
12 153.0 0.970760
13 165.0 0.982456
14 177.0 0.970760
15 189.0 0.964912
In [159]:
rf_model = RandomForestClassifier(n_estimators = 350, criterion = 'entropy')
rf_model = rf_model.fit(train_set, train_labels)
test_pred = rf_model.predict(test_set)
rf_model.score(test_set , test_labels)
Out[159]:
0.9649122807017544

----------------Ensemble--------------

In [160]:
from sklearn.ensemble import RandomForestClassifier
In [161]:
rfc = RandomForestClassifier(n_estimators=100)
In [162]:
rfc.fit(X_train,y_train)
C:\Users\bojha\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  if __name__ == '__main__':
Out[162]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
In [163]:
predictions = rfc.predict(X_test)
In [164]:
from sklearn import metrics
from sklearn import preprocessing
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
             precision    recall  f1-score   support

          0       0.98      1.00      0.99       116
          1       1.00      0.96      0.98        55

avg / total       0.99      0.99      0.99       171

In [165]:
print(confusion_matrix(y_test,predictions))
[[116   0]
 [  2  53]]
In [166]:
rfc.score(X_test, y_test)
Out[166]:
0.9883040935672515
In [172]:
lrcl = LogisticRegression(random_state=1)
rfcl = RandomForestClassifier(random_state=1)
nbcl = GaussianNB()
bgcl = BaggingClassifier(base_estimator=cancer_model, n_estimators=90)  #the base_estimator can be null. The bagging classifer  will build it's own tree

enclf = VotingClassifier(estimators = [('lor', lrcl), ('rf', rfcl), ('nb', nbcl), ('bg', bgcl)], voting = 'hard')
In [173]:
import warnings
warnings.filterwarnings('ignore')
for clf, label in zip([lrcl , rfcl, nbcl, enclf, bgcl], ['Logistic Regression', 'RandomForest', 'NaiveBayes', 'Ensemble', 'Bagging']):
    scores = cross_val_score(clf, X, Y, cv=5, scoring='accuracy')
    print("Accuracy: %0.02f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label ))
Accuracy: 0.95 (+/- 0.02) [Logistic Regression]
Accuracy: 0.95 (+/- 0.02) [RandomForest]
Accuracy: 0.94 (+/- 0.02) [NaiveBayes]
Accuracy: 0.95 (+/- 0.02) [Ensemble]
Accuracy: 0.95 (+/- 0.02) [Bagging]
In [253]:
y_train = np.ravel(y_train)   # converting y_train vector to single dimensional array
rfcl = RandomForestClassifier(random_state=1)

bgcl = BaggingClassifier(base_estimator=dt_model, n_estimators=20)  #the base_estimator can be null.
#The bagging classifer  will build it's own tree

enclf = VotingClassifier(estimators = [('rf', rfcl), ('bg', bgcl)], voting = 'hard')

for clf, label in zip([rfcl, enclf, bgcl], ['RandomForest', 'Ensemble', 'Bagging']):
    clf.fit(X_train, y_train)
    y_predict = clf.predict(X_test)
    print(metrics.classification_report(y_test, y_predict))
             precision    recall  f1-score   support

          0       0.95      0.97      0.96       116
          1       0.94      0.89      0.92        55

avg / total       0.95      0.95      0.95       171

             precision    recall  f1-score   support

          0       0.94      0.99      0.97       116
          1       0.98      0.87      0.92        55

avg / total       0.95      0.95      0.95       171

             precision    recall  f1-score   support

          0       0.95      0.97      0.96       116
          1       0.94      0.89      0.92        55

avg / total       0.95      0.95      0.95       171

As we have seen, Both RF and Ensemble techniques performed better than single classifiers. This is due to:

Single classifiers tend to have "High Variance". Combining multiple classifiers, the variance comes down resulting in better stable models. Voting is one of the simplest ways of combining the predictions from multiple machine learning algorithms.Voting classifier does even better than other classifier. In RF,ensemble of the same type of CART. However, when we created a Voting clasisfier, we combine different types of models -linear models and non linear models.RF is like a bootstraping algorithm so it have power of handle large dataset with higher dimesionality.It can handle 1000 of i/p variables and identifies most significant variable so it's considered as one of the dimension reduction method

APPLYING PCA

1.Standardize​ ​the d-dimensional​ ​dataset.

In [176]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.stats import zscore
In [177]:
df.head(2)
Out[177]:
B-M radius texture perimeter area smoothness compactness concavity concave points Symmetry ... radius-W texture-W perimeter-W area-W smoothness-W compactness-W concavity-W concave points-W Symmetry-W fractal dimension-W
0 1 17.99 10.38 122.8 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 ... 25.38 17.33 184.6 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 1 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 ... 24.99 23.41 158.8 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902

2 rows × 31 columns

In [178]:
df.drop('B-M',axis =1,inplace=True)
In [179]:
cancer_data_std = StandardScaler().fit_transform(df)
In [180]:
pd.DataFrame(cancer_data_std).describe().transpose()
Out[180]:
count mean std min 25% 50% 75% max
0 569.0 -1.256562e-16 1.00088 -2.029648 -0.689385 -0.215082 0.469393 3.971288
1 569.0 1.049736e-16 1.00088 -2.229249 -0.725963 -0.104636 0.584176 4.651889
2 569.0 -1.272171e-16 1.00088 -1.984504 -0.691956 -0.235980 0.499677 3.976130
3 569.0 -1.900452e-16 1.00088 -1.454443 -0.667195 -0.295187 0.363507 5.250529
4 569.0 -8.226187e-16 1.00088 -3.112085 -0.710963 -0.034891 0.636199 4.770911
5 569.0 2.419467e-16 1.00088 -1.610136 -0.747086 -0.221940 0.493857 4.568425
6 569.0 -1.315097e-16 1.00088 -1.114873 -0.743748 -0.342240 0.526062 4.243589
7 569.0 -8.780323e-17 1.00088 -1.261820 -0.737944 -0.397721 0.646935 3.927930
8 569.0 1.957036e-16 1.00088 -2.744117 -0.703240 -0.071627 0.530779 4.484751
9 569.0 5.073075e-16 1.00088 -1.819865 -0.722639 -0.178279 0.470983 4.910919
10 569.0 2.588732e-16 1.00088 -1.059924 -0.623571 -0.292245 0.266100 8.906909
11 569.0 -8.887638e-17 1.00088 -1.554264 -0.694809 -0.197498 0.466552 6.655279
12 569.0 -7.516932e-17 1.00088 -1.044049 -0.623768 -0.286652 0.243031 9.461986
13 569.0 -1.088760e-16 1.00088 -0.737829 -0.494754 -0.347783 0.106773 11.041842
14 569.0 -1.588385e-16 1.00088 -1.776065 -0.624018 -0.220335 0.368355 8.029999
15 569.0 2.341419e-16 1.00088 -1.298098 -0.692926 -0.281020 0.389654 6.143482
16 569.0 2.044840e-16 1.00088 -1.057501 -0.557161 -0.199065 0.336752 12.072680
17 569.0 3.707247e-17 1.00088 -1.913447 -0.674490 -0.140496 0.472657 6.649601
18 569.0 1.242903e-16 1.00088 -1.532890 -0.651681 -0.219430 0.355692 7.071917
19 569.0 -4.351138e-17 1.00088 -1.096968 -0.585118 -0.229940 0.288642 9.851593
20 569.0 -7.956924e-16 1.00088 -1.726901 -0.674921 -0.269040 0.522016 4.094189
21 569.0 -1.834112e-17 1.00088 -2.223994 -0.748629 -0.043516 0.658341 3.885905
22 569.0 -4.015534e-16 1.00088 -1.693361 -0.689578 -0.285980 0.540279 4.287337
23 569.0 -2.848727e-17 1.00088 -1.222423 -0.642136 -0.341181 0.357589 5.930172
24 569.0 -2.251665e-16 1.00088 -2.682695 -0.691230 -0.046843 0.597545 3.955374
25 569.0 -2.579464e-16 1.00088 -1.443878 -0.681083 -0.269501 0.539669 5.112877
26 569.0 1.143393e-16 1.00088 -1.305831 -0.756514 -0.218232 0.531141 4.700669
27 569.0 3.203842e-16 1.00088 -1.745063 -0.756400 -0.223469 0.712510 2.685877
28 569.0 1.783381e-16 1.00088 -2.160960 -0.641864 -0.127409 0.450138 6.046041
29 569.0 -6.436952e-16 1.00088 -1.601839 -0.691912 -0.216444 0.450762 6.846856

2.Construct the co-variance Matrix

In [181]:
cov_matrix = np.cov(cancer_data_std.T)
print('Covariance Matrix \n%s', cov_matrix)
Covariance Matrix 
%s [[ 1.00176056e+00  3.24351929e-01  9.99612069e-01  9.89095475e-01
   1.70881506e-01  5.07014640e-01  6.77955036e-01  8.23976636e-01
   1.48001350e-01 -3.12179472e-01  6.80285970e-01 -9.74887767e-02
   6.75358538e-01  7.37159198e-01 -2.22992026e-01  2.06362656e-01
   1.94545531e-01  3.76831225e-01 -1.04504545e-01 -4.27163418e-02
   9.71245907e-01  2.97530545e-01  9.66835698e-01  9.42739295e-01
   1.19826732e-01  4.14190751e-01  5.27839123e-01  7.45524434e-01
   1.64241985e-01  7.07832563e-03]
 [ 3.24351929e-01  1.00176056e+00  3.30113223e-01  3.21650988e-01
  -2.34296930e-02  2.37118951e-01  3.02950254e-01  2.93980713e-01
   7.15266864e-02 -7.65717560e-02  2.76354360e-01  3.87037830e-01
   2.82169018e-01  2.60302460e-01  6.62542133e-03  1.92312595e-01
   1.43545353e-01  1.64139495e-01  9.14323671e-03  5.45533955e-02
   3.53193674e-01  9.13650301e-01  3.58669926e-01  3.44150782e-01
   7.76398084e-02  2.78318729e-01  3.01555198e-01  2.95835766e-01
   1.05192783e-01  1.19415220e-01]
 [ 9.99612069e-01  3.30113223e-01  1.00176056e+00  9.88243612e-01
   2.07643090e-01  5.57916732e-01  7.17396452e-01  8.52475240e-01
   1.83349443e-01 -2.61937255e-01  6.92982910e-01 -8.69138267e-02
   6.94355197e-01  7.46294283e-01 -2.03050882e-01  2.51185131e-01
   2.28483899e-01  4.07933847e-01 -8.17730406e-02 -5.53311534e-03
   9.71183188e-01  3.03571890e-01  9.72095315e-01  9.43207466e-01
   1.50814456e-01  4.56576647e-01  5.64872009e-01  7.72598608e-01
   1.89447989e-01  5.11083511e-02]
 [ 9.89095475e-01  3.21650988e-01  9.88243612e-01  1.00176056e+00
   1.77340047e-01  4.99379326e-01  6.87190545e-01  8.24718286e-01
   1.51559440e-01 -2.83608244e-01  7.33851949e-01 -6.63969041e-02
   7.27907603e-01  8.01494523e-01 -1.67070287e-01  2.12956816e-01
   2.08025659e-01  3.72975776e-01 -7.26242231e-02 -1.99219755e-02
   9.64441062e-01  2.87994769e-01  9.60808165e-01  9.60902082e-01
   1.23740409e-01  3.91097651e-01  5.13508396e-01  7.23287782e-01
   1.43822678e-01  3.74417763e-03]
 [ 1.70881506e-01 -2.34296930e-02  2.07643090e-01  1.77340047e-01
   1.00176056e+00  6.60283643e-01  5.22902753e-01  5.54669988e-01
   5.58756786e-01  5.85821565e-01  3.01997850e-01  6.85268821e-02
   2.96613222e-01  2.46986503e-01  3.32960611e-01  3.19504817e-01
   2.48832996e-01  3.81345895e-01  2.01127852e-01  2.84106006e-01
   2.13495353e-01  3.61353055e-02  2.39273141e-01  2.07082304e-01
   8.06742020e-01  4.73300254e-01  4.35691429e-01  5.03939011e-01
   3.95003689e-01  5.00195447e-01]
 [ 5.07014640e-01  2.37118951e-01  5.57916732e-01  4.99379326e-01
   6.60283643e-01  1.00176056e+00  8.84675460e-01  8.32598309e-01
   6.03702036e-01  5.66364031e-01  4.98349280e-01  4.62861772e-02
   5.49871647e-01  4.56455058e-01  1.35537471e-01  7.40022356e-01
   5.71521303e-01  6.43392594e-01  2.30381479e-01  5.08211293e-01
   5.36257855e-01  2.48569687e-01  5.91249531e-01  5.10500995e-01
   5.66536837e-01  8.67333351e-01  8.17712354e-01  8.17009092e-01
   5.11121711e-01  6.88592503e-01]
 [ 6.77955036e-01  3.02950254e-01  7.17396452e-01  6.87190545e-01
   5.22902753e-01  8.84675460e-01  1.00176056e+00  9.23013194e-01
   5.01548072e-01  3.37376288e-01  6.33037366e-01  7.63525354e-02
   6.61553447e-01  6.18513825e-01  9.87372735e-02  6.71458893e-01
   6.92487233e-01  6.84462839e-01  1.78322604e-01  4.50091771e-01
   6.89448091e-01  3.00406844e-01  7.30849362e-01  6.77177350e-01
   4.49612218e-01  7.56297185e-01  8.85659158e-01  8.62839447e-01
   4.10185014e-01  5.15836457e-01]
 [ 8.23976636e-01  2.93980713e-01  8.52475240e-01  8.24718286e-01
   5.54669988e-01  8.32598309e-01  9.23013194e-01  1.00176056e+00
   4.63311644e-01  1.67211252e-01  6.99278795e-01  2.15173981e-02
   7.11901016e-01  6.91513854e-01  2.77019938e-02  4.91287673e-01
   4.39940250e-01  6.16717994e-01  9.55186580e-02  2.58037239e-01
   8.31779458e-01  2.93267121e-01  8.57430035e-01  8.11055024e-01
   4.53550155e-01  6.68628771e-01  7.53724145e-01  9.11757700e-01
   3.76405667e-01  3.69310185e-01]
 [ 1.48001350e-01  7.15266864e-02  1.83349443e-01  1.51559440e-01
   5.58756786e-01  6.03702036e-01  5.01548072e-01  4.63311644e-01
   1.00176056e+00  4.80766262e-01  3.03913382e-01  1.28278372e-01
   3.14445389e-01  2.24364533e-01  1.87650956e-01  4.22401505e-01
   3.43230240e-01  3.93990298e-01  4.49927276e-01  3.32370277e-01
   1.86054739e-01  9.08102844e-02  2.19554419e-01  1.77505338e-01
   4.27426215e-01  4.74033112e-01  4.34484601e-01  4.31054176e-01
   7.01057885e-01  4.39185353e-01]
 [-3.12179472e-01 -7.65717560e-02 -2.61937255e-01 -2.83608244e-01
   5.85821565e-01  5.66364031e-01  3.37376288e-01  1.67211252e-01
   4.80766262e-01  1.00176056e+00  1.11190486e-04  1.64463005e-01
   3.99000547e-02 -9.03289980e-02  4.02672109e-01  5.60822319e-01
   4.47416643e-01  3.41798745e-01  3.45614805e-01  6.89343077e-01
  -2.54138135e-01 -5.13594647e-02 -2.05512393e-01 -2.32262646e-01
   5.05831058e-01  4.59605900e-01  3.46843443e-01  1.75634121e-01
   3.34606745e-01  7.68647654e-01]
 [ 6.80285970e-01  2.76354360e-01  6.92982910e-01  7.33851949e-01
   3.01997850e-01  4.98349280e-01  6.33037366e-01  6.99278795e-01
   3.03913382e-01  1.11190486e-04  1.00176056e+00  2.13622773e-01
   9.74506342e-01  9.53505869e-01  1.64803858e-01  3.56691450e-01
   3.32942674e-01  5.14250220e-01  2.40990897e-01  2.28154507e-01
   7.16324113e-01  1.95141512e-01  7.20950853e-01  7.52871625e-01
   1.42168410e-01  2.87608629e-01  3.81254678e-01  5.31997297e-01
   9.47092790e-02  4.96466850e-02]
 [-9.74887767e-02  3.87037830e-01 -8.69138267e-02 -6.63969041e-02
   6.85268821e-02  4.62861772e-02  7.63525354e-02  2.15173981e-02
   1.28278372e-01  1.64463005e-01  2.13622773e-01  1.00176056e+00
   2.23563635e-01  1.11763668e-01  3.97942224e-01  2.32107621e-01
   1.95341772e-01  2.30688828e-01  4.12345364e-01  2.80215217e-01
  -1.11886951e-01  4.09722842e-01 -1.02421925e-01 -8.33414586e-02
  -7.37873381e-02 -9.26020990e-02 -6.90776223e-02 -1.19848153e-01
  -1.28440488e-01 -4.57349464e-02]
 [ 6.75358538e-01  2.82169018e-01  6.94355197e-01  7.27907603e-01
   2.96613222e-01  5.49871647e-01  6.61553447e-01  7.11901016e-01
   3.14445389e-01  3.99000547e-02  9.74506342e-01  2.23563635e-01
   1.00176056e+00  9.39306209e-01  1.51341309e-01  4.17055330e-01
   3.63119754e-01  5.57243422e-01  2.66956259e-01  2.44572602e-01
   6.98428059e-01  2.00723620e-01  7.22300731e-01  7.31999440e-01
   1.30283361e-01  3.42521416e-01  4.19636314e-01  5.55874162e-01
   1.10123974e-01  8.55829815e-02]
 [ 7.37159198e-01  2.60302460e-01  7.46294283e-01  8.01494523e-01
   2.46986503e-01  4.56455058e-01  6.18513825e-01  6.91513854e-01
   2.24364533e-01 -9.03289980e-02  9.53505869e-01  1.11763668e-01
   9.39306209e-01  1.00176056e+00  7.52826451e-02  2.85341536e-01
   2.71371654e-01  4.16461487e-01  1.34345087e-01  1.27294619e-01
   7.58706592e-01  1.96842594e-01  7.62552799e-01  8.12836496e-01
   1.25610187e-01  2.83755229e-01  3.85778129e-01  5.39113790e-01
   7.42567956e-02  1.75701742e-02]
 [-2.22992026e-01  6.62542133e-03 -2.03050882e-01 -1.67070287e-01
   3.32960611e-01  1.35537471e-01  9.87372735e-02  2.77019938e-02
   1.87650956e-01  4.02672109e-01  1.64803858e-01  3.97942224e-01
   1.51341309e-01  7.52826451e-02  1.00176056e+00  3.37288855e-01
   2.69157796e-01  3.29007720e-01  4.14234129e-01  4.28126626e-01
  -2.31096855e-01 -7.48745546e-02 -2.17686332e-01 -1.82516245e-01
   3.15011078e-01 -5.56559523e-02 -5.84010247e-02 -1.02186386e-01
  -1.07531080e-01  1.01658978e-01]
 [ 2.06362656e-01  1.92312595e-01  2.51185131e-01  2.12956816e-01
   3.19504817e-01  7.40022356e-01  6.71458893e-01  4.91287673e-01
   4.22401505e-01  5.60822319e-01  3.56691450e-01  2.32107621e-01
   4.17055330e-01  2.85341536e-01  3.37288855e-01  1.00176056e+00
   8.02679026e-01  7.45392672e-01  3.95407752e-01  8.04683023e-01
   2.04967390e-01  1.43254348e-01  2.60974494e-01  1.99722335e-01
   2.27794574e-01  6.79975390e-01  6.40271956e-01  4.84059046e-01
   2.78367653e-01  5.92013208e-01]
 [ 1.94545531e-01  1.43545353e-01  2.28483899e-01  2.08025659e-01
   2.48832996e-01  5.71521303e-01  6.92487233e-01  4.39940250e-01
   3.43230240e-01  4.47416643e-01  3.32942674e-01  1.95341772e-01
   3.63119754e-01  2.71371654e-01  2.69157796e-01  8.02679026e-01
   1.00176056e+00  7.73162805e-01  3.09973347e-01  7.28652769e-01
   1.87232571e-01  1.00417464e-01  2.27079511e-01  1.88684259e-01
   1.68777943e-01  4.85711424e-01  6.63730620e-01  4.41247742e-01
   1.98136040e-01  4.40102736e-01]
 [ 3.76831225e-01  1.64139495e-01  4.07933847e-01  3.72975776e-01
   3.81345895e-01  6.43392594e-01  6.84462839e-01  6.16717994e-01
   3.93990298e-01  3.41798745e-01  5.14250220e-01  2.30688828e-01
   5.57243422e-01  4.16461487e-01  3.29007720e-01  7.45392672e-01
   7.73162805e-01  1.00176056e+00  3.13330893e-01  6.12119921e-01
   3.58757174e-01  8.68939233e-02  3.95694673e-01  3.42873752e-01
   2.15729735e-01  4.53685716e-01  5.50559967e-01  6.03510257e-01
   1.43367633e-01  3.11201479e-01]
 [-1.04504545e-01  9.14323671e-03 -8.17730406e-02 -7.26242231e-02
   2.01127852e-01  2.30381479e-01  1.78322604e-01  9.55186580e-02
   4.49927276e-01  3.45614805e-01  2.40990897e-01  4.12345364e-01
   2.66956259e-01  1.34345087e-01  4.14234129e-01  3.95407752e-01
   3.09973347e-01  3.13330893e-01  1.00176056e+00  3.69727869e-01
  -1.28346334e-01 -7.76098171e-02 -1.03935708e-01 -1.10537008e-01
  -1.26840915e-02  6.03609620e-02  3.71843990e-02 -3.04669411e-02
   3.90088053e-01  7.82169401e-02]
 [-4.27163418e-02  5.45533955e-02 -5.53311534e-03 -1.99219755e-02
   2.84106006e-01  5.08211293e-01  4.50091771e-01  2.58037239e-01
   3.32370277e-01  6.89343077e-01  2.28154507e-01  2.80215217e-01
   2.44572602e-01  1.27294619e-01  4.28126626e-01  8.04683023e-01
   7.28652769e-01  6.12119921e-01  3.69727869e-01  1.00176056e+00
  -3.75536172e-02 -3.20065392e-03 -1.00215889e-03 -2.27761757e-02
   1.70868612e-01  3.90845741e-01  3.80643631e-01  2.15582894e-01
   1.11289544e-01  5.92369136e-01]
 [ 9.71245907e-01  3.53193674e-01  9.71183188e-01  9.64441062e-01
   2.13495353e-01  5.36257855e-01  6.89448091e-01  8.31779458e-01
   1.86054739e-01 -2.54138135e-01  7.16324113e-01 -1.11886951e-01
   6.98428059e-01  7.58706592e-01 -2.31096855e-01  2.04967390e-01
   1.87232571e-01  3.58757174e-01 -1.28346334e-01 -3.75536172e-02
   1.00176056e+00  3.60554418e-01  9.95457402e-01  9.85746984e-01
   2.16955724e-01  4.76657749e-01  5.74985227e-01  7.88810161e-01
   2.43957953e-01  9.36565772e-02]
 [ 2.97530545e-01  9.13650301e-01  3.03571890e-01  2.87994769e-01
   3.61353055e-02  2.48569687e-01  3.00406844e-01  2.93267121e-01
   9.08102844e-02 -5.13594647e-02  1.95141512e-01  4.09722842e-01
   2.00723620e-01  1.96842594e-01 -7.48745546e-02  1.43254348e-01
   1.00417464e-01  8.68939233e-02 -7.76098171e-02 -3.20065392e-03
   3.60554418e-01  1.00176056e+00  3.65741024e-01  3.46451160e-01
   2.25826298e-01  3.61467607e-01  3.69014138e-01  3.60387980e-01
   2.33437721e-01  2.19508204e-01]
 [ 9.66835698e-01  3.58669926e-01  9.72095315e-01  9.60808165e-01
   2.39273141e-01  5.91249531e-01  7.30849362e-01  8.57430035e-01
   2.19554419e-01 -2.05512393e-01  7.20950853e-01 -1.02421925e-01
   7.22300731e-01  7.62552799e-01 -2.17686332e-01  2.60974494e-01
   2.27079511e-01  3.95694673e-01 -1.03935708e-01 -1.00215889e-03
   9.95457402e-01  3.65741024e-01  1.00176056e+00  9.79299180e-01
   2.37191461e-01  5.30339746e-01  6.19432713e-01  8.17759288e-01
   2.69967228e-01  1.39201504e-01]
 [ 9.42739295e-01  3.44150782e-01  9.43207466e-01  9.60902082e-01
   2.07082304e-01  5.10500995e-01  6.77177350e-01  8.11055024e-01
   1.77505338e-01 -2.32262646e-01  7.52871625e-01 -8.33414586e-02
   7.31999440e-01  8.12836496e-01 -1.82516245e-01  1.99722335e-01
   1.88684259e-01  3.42873752e-01 -1.10537008e-01 -2.27761757e-02
   9.85746984e-01  3.46451160e-01  9.79299180e-01  1.00176056e+00
   2.09513547e-01  4.39067932e-01  5.44287093e-01  7.48734680e-01
   2.09513722e-01  7.97872577e-02]
 [ 1.19826732e-01  7.76398084e-02  1.50814456e-01  1.23740409e-01
   8.06742020e-01  5.66536837e-01  4.49612218e-01  4.53550155e-01
   4.27426215e-01  5.05831058e-01  1.42168410e-01 -7.37873381e-02
   1.30283361e-01  1.25610187e-01  3.15011078e-01  2.27794574e-01
   1.68777943e-01  2.15729735e-01 -1.26840915e-02  1.70868612e-01
   2.16955724e-01  2.25826298e-01  2.37191461e-01  2.09513547e-01
   1.00176056e+00  5.69186845e-01  5.19436186e-01  5.48655147e-01
   4.94707764e-01  6.18711558e-01]
 [ 4.14190751e-01  2.78318729e-01  4.56576647e-01  3.91097651e-01
   4.73300254e-01  8.67333351e-01  7.56297185e-01  6.68628771e-01
   4.74033112e-01  4.59605900e-01  2.87608629e-01 -9.26020990e-02
   3.42521416e-01  2.83755229e-01 -5.56559523e-02  6.79975390e-01
   4.85711424e-01  4.53685716e-01  6.03609620e-02  3.90845741e-01
   4.76657749e-01  3.61467607e-01  5.30339746e-01  4.39067932e-01
   5.69186845e-01  1.00176056e+00  8.93831781e-01  8.02490717e-01
   6.15522263e-01  8.11881713e-01]
 [ 5.27839123e-01  3.01555198e-01  5.64872009e-01  5.13508396e-01
   4.35691429e-01  8.17712354e-01  8.85659158e-01  7.53724145e-01
   4.34484601e-01  3.46843443e-01  3.81254678e-01 -6.90776223e-02
   4.19636314e-01  3.85778129e-01 -5.84010247e-02  6.40271956e-01
   6.63730620e-01  5.50559967e-01  3.71843990e-02  3.80643631e-01
   5.74985227e-01  3.69014138e-01  6.19432713e-01  5.44287093e-01
   5.19436186e-01  8.93831781e-01  1.00176056e+00  8.56939906e-01
   5.33457264e-01  6.87719567e-01]
 [ 7.45524434e-01  2.95835766e-01  7.72598608e-01  7.23287782e-01
   5.03939011e-01  8.17009092e-01  8.62839447e-01  9.11757700e-01
   4.31054176e-01  1.75634121e-01  5.31997297e-01 -1.19848153e-01
   5.55874162e-01  5.39113790e-01 -1.02186386e-01  4.84059046e-01
   4.41247742e-01  6.03510257e-01 -3.04669411e-02  2.15582894e-01
   7.88810161e-01  3.60387980e-01  8.17759288e-01  7.48734680e-01
   5.48655147e-01  8.02490717e-01  8.56939906e-01  1.00176056e+00
   5.03413227e-01  5.12013995e-01]
 [ 1.64241985e-01  1.05192783e-01  1.89447989e-01  1.43822678e-01
   3.95003689e-01  5.11121711e-01  4.10185014e-01  3.76405667e-01
   7.01057885e-01  3.34606745e-01  9.47092790e-02 -1.28440488e-01
   1.10123974e-01  7.42567956e-02 -1.07531080e-01  2.78367653e-01
   1.98136040e-01  1.43367633e-01  3.90088053e-01  1.11289544e-01
   2.43957953e-01  2.33437721e-01  2.69967228e-01  2.09513722e-01
   4.94707764e-01  6.15522263e-01  5.33457264e-01  5.03413227e-01
   1.00176056e+00  5.38795122e-01]
 [ 7.07832563e-03  1.19415220e-01  5.11083511e-02  3.74417763e-03
   5.00195447e-01  6.88592503e-01  5.15836457e-01  3.69310185e-01
   4.39185353e-01  7.68647654e-01  4.96466850e-02 -4.57349464e-02
   8.55829815e-02  1.75701742e-02  1.01658978e-01  5.92013208e-01
   4.40102736e-01  3.11201479e-01  7.82169401e-02  5.92369136e-01
   9.36565772e-02  2.19508204e-01  1.39201504e-01  7.97872577e-02
   6.18711558e-01  8.11881713e-01  6.87719567e-01  5.12013995e-01
   5.38795122e-01  1.00176056e+00]]

3.Decompose​ ​the​ ​covariance​ ​matrix​ ​into​ ​its​ ​eigenvectors​ ​and​ ​eigenvalues

In [182]:
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s', eig_vecs)
print('\n Eigen Values \n%s', eig_vals)
Eigen Vectors 
%s [[ 2.18902444e-01 -2.33857132e-01 -8.53124284e-03  4.14089623e-02
  -3.77863538e-02  1.87407904e-02  1.24088340e-01  7.45229622e-03
  -2.23109764e-01  9.54864432e-02  4.14714866e-02  5.10674568e-02
   1.19672116e-02 -5.95061348e-02  5.11187749e-02 -1.50583883e-01
   2.02924255e-01  1.46712338e-01 -2.25384659e-01 -7.02414091e-01
   2.11460455e-01 -2.11194013e-01 -1.31526670e-01  1.29476396e-01
   1.92264989e-02 -1.82579441e-01  9.85526942e-02 -7.29289034e-02
  -4.96986642e-02  6.85700057e-02]
 [ 1.03724578e-01 -5.97060883e-02  6.45499033e-02 -6.03050001e-01
   4.94688505e-02 -3.21788366e-02 -1.13995382e-02 -1.30674825e-01
   1.12699390e-01  2.40934066e-01 -3.02243402e-01  2.54896423e-01
   2.03461333e-01  2.15600995e-02  1.07922421e-01 -1.57841960e-01
  -3.87061187e-02 -4.11029851e-02 -2.97886446e-02 -2.73661018e-04
  -1.05339342e-02  6.58114593e-05 -1.73573093e-02  2.45566636e-02
  -8.47459309e-02  9.87867898e-02  5.54997454e-04 -9.48006326e-02
  -2.44134993e-01 -4.48369467e-01]
 [ 2.27537293e-01 -2.15181361e-01 -9.31421972e-03  4.19830991e-02
  -3.73746632e-02  1.73084449e-02  1.14477057e-01  1.86872582e-02
  -2.23739213e-01  8.63856150e-02  1.67826374e-02  3.89261058e-02
   4.41095034e-02 -4.85138123e-02  3.99029358e-02 -1.14453955e-01
   1.94821310e-01  1.58317455e-01 -2.39595276e-01  6.89896968e-01
   3.83826098e-01 -8.43382663e-02 -1.15415423e-01  1.25255946e-01
  -2.70154137e-02 -1.16648876e-01  4.02447050e-02 -7.51604777e-02
  -1.76650122e-02  6.97690429e-02]
 [ 2.20994985e-01 -2.31076711e-01  2.86995259e-02  5.34337955e-02
  -1.03312514e-02 -1.88774796e-03  5.16534275e-02 -3.46736038e-02
  -1.95586014e-01  7.49564886e-02  1.10169643e-01  6.54375082e-02
   6.73757374e-02 -1.08308292e-02 -1.39669069e-02 -1.32448032e-01
   2.55705763e-01  2.66168105e-01  2.73221894e-02  3.29473482e-02
  -4.22794920e-01  2.72508323e-01  4.66612477e-01 -3.62727403e-01
   2.10040780e-01  6.98483369e-02 -7.77727342e-03 -9.75657781e-02
  -9.01437617e-02  1.84432785e-02]
 [ 1.42589694e-01  1.86113023e-01 -1.04291904e-01  1.59382765e-01
   3.65088528e-01 -2.86374497e-01  1.40668993e-01  2.88974575e-01
   6.42472194e-03 -6.92926813e-02 -1.37021842e-01  3.16727211e-01
   4.55736020e-02 -4.45064860e-01  1.18143364e-01 -2.04613247e-01
   1.67929914e-01 -3.52226802e-01  1.64565843e-01  4.84745766e-03
  -3.43466700e-03 -1.47926883e-03  6.96899233e-02  3.70036864e-02
  -2.89548850e-02  6.86974224e-02  2.06657211e-02 -6.38229479e-02
   1.71009601e-02  1.19491747e-01]
 [ 2.39285354e-01  1.51891610e-01 -7.40915709e-02  3.17945811e-02
  -1.17039713e-02 -1.41309489e-02 -3.09184960e-02  1.51396350e-01
  -1.67841425e-01  1.29362000e-02 -3.08009633e-01 -1.04017044e-01
   2.29281304e-01 -8.10105720e-03 -2.30899962e-01  1.70178367e-01
  -2.03077075e-02  7.79413843e-03 -2.84222358e-01 -4.46741863e-02
  -4.10167739e-02  5.46276696e-03  9.77487054e-02 -2.62808474e-01
  -3.96623231e-01 -1.04135518e-01 -5.23603957e-02  9.80775567e-02
   4.88686329e-01 -1.92621396e-01]
 [ 2.58400481e-01  6.01653628e-02  2.73383798e-03  1.91227535e-02
  -8.63754118e-02 -9.34418089e-03  1.07520443e-01  7.28272853e-02
   4.05910064e-02 -1.35602298e-01  1.24190245e-01  6.56534798e-02
   3.87090806e-01  1.89358699e-01  1.28283732e-01  2.69470206e-01
  -1.59835337e-03 -2.69681105e-02 -2.26636013e-03 -2.51386661e-02
  -1.00147876e-02 -4.55386379e-02  3.64808397e-01  5.48876170e-01
   9.69773167e-02  4.47410568e-02 -3.24870378e-01  1.85212003e-01
  -3.33870858e-02 -5.57175335e-03]
 [ 2.60853758e-01 -3.47675005e-02 -2.55635406e-02  6.53359443e-02
   4.38610252e-02 -5.20499505e-02  1.50482214e-01  1.52322414e-01
  -1.11971106e-01  8.05452775e-03 -7.24460264e-02  4.25892667e-02
   1.32138097e-01  2.44794768e-01  2.17099194e-01  3.80464095e-01
   3.45095087e-02 -8.28277367e-02  1.54972363e-01  1.07726530e-03
  -4.20694931e-03  8.88309714e-03 -4.54699351e-01 -3.87643377e-01
   1.86451602e-01  8.40276972e-02  5.14087968e-02  3.11852431e-01
  -2.35407606e-01  9.42381870e-03]
 [ 1.38166959e-01  1.90348770e-01 -4.02399363e-02  6.71249840e-02
   3.05941428e-01  3.56458461e-01  9.38911345e-02  2.31530989e-01
   2.56040084e-01  5.72069479e-01  1.63054081e-01 -2.88865504e-01
   1.89933673e-01 -3.07388563e-02  7.39617071e-02 -1.64661588e-01
  -1.91737848e-01  1.73397790e-01  5.88111647e-02  1.28037941e-03
  -7.56986244e-03 -1.43302642e-03 -1.51648349e-02  1.60440385e-02
   2.45836949e-02  1.93394733e-02  5.12005770e-02  1.84067326e-02
   2.60691555e-02  8.69384844e-02]
 [ 6.43633464e-02  3.66575471e-01 -2.25740897e-02  4.85867649e-02
   4.44243602e-02 -1.19430668e-01 -2.95760024e-01  1.77121441e-01
  -1.23740789e-01  8.11032072e-02 -3.80482687e-02  2.36358988e-01
   1.06239082e-01  3.77078865e-01 -5.17975705e-01 -4.07927860e-02
   5.02252456e-02  8.78673570e-02  5.81570509e-02  4.75568480e-03
   7.30143287e-03  6.31168651e-03 -1.01244946e-01  9.74048386e-02
   2.07221864e-01 -1.33260547e-01  8.46898562e-02 -2.87868885e-01
  -1.75637222e-01  7.62718362e-02]
 [ 2.05978776e-01 -1.05552152e-01  2.68481387e-01  9.79412418e-02
   1.54456496e-01 -2.56032561e-02 -3.12490037e-01 -2.25399674e-02
   2.49985002e-01 -4.95475941e-02 -2.53570194e-02 -1.66879153e-02
  -6.81952298e-02 -1.03474126e-02  1.10050711e-01  5.89057190e-02
  -1.39396866e-01 -2.36216532e-01 -1.75883308e-01  8.71109373e-03
   1.18442112e-01  1.92223890e-01  2.12982901e-01 -4.99770798e-02
   1.74930429e-01 -5.58701567e-01  2.64125317e-01  1.50274681e-01
  -9.08005031e-02 -8.63867747e-02]
 [ 1.74280281e-02  8.99796818e-02  3.74633665e-01 -3.59855528e-01
   1.91650506e-01 -2.87473145e-02  9.07553556e-02  4.75413139e-01
  -2.46645397e-01 -2.89142742e-01  3.44944458e-01 -3.06160423e-01
  -1.68222383e-01  1.08493473e-02 -3.27527212e-02 -3.45004006e-02
   4.39630156e-02 -9.85866201e-03 -3.60098518e-02  1.07103919e-03
  -8.77627920e-03  5.62261069e-03 -1.00928890e-02  1.12372419e-02
  -5.69864778e-02  2.42672970e-02  8.73880467e-04 -4.84569345e-02
  -7.16599878e-02 -2.17071967e-01]
 [ 2.11325916e-01 -8.94572342e-02  2.66645367e-01  8.89924146e-02
   1.20990220e-01  1.81071500e-03 -3.14640390e-01  1.18966905e-02
   2.27154024e-01 -1.14508236e-01 -1.67318771e-01 -1.01446828e-01
  -3.78439858e-02  4.55237175e-02  8.26808881e-03  2.65166513e-02
  -2.46356391e-02 -2.59288003e-02 -3.65701538e-01 -1.37293906e-02
  -6.10021933e-03 -2.63191868e-01  4.16915529e-02 -1.03653282e-01
  -7.29276412e-02  5.16750385e-01 -9.00742110e-02 -1.59352804e-01
  -1.77250625e-01  3.04950158e-01]
 [ 2.02869635e-01 -1.52292628e-01  2.16006528e-01  1.08205039e-01
   1.27574432e-01 -4.28639079e-02 -3.46679003e-01 -8.58051345e-02
   2.29160015e-01 -9.19278886e-02  5.16194632e-02 -1.76792177e-02
   5.60649318e-02 -8.35707181e-02  4.60243656e-02  4.11532265e-02
   3.34418173e-01  3.04906903e-01  4.16572314e-01 -1.10532603e-03
  -8.59259138e-02  4.20681051e-02 -3.13358657e-01  1.55304589e-01
  -1.31850405e-01 -2.24607172e-02 -9.82150746e-02 -6.42326151e-02
   2.74201148e-01 -1.92587786e-01]
 [ 1.45314521e-02  2.04430453e-01  3.08838979e-01  4.46641797e-02
   2.32065676e-01 -3.42917393e-01  2.44024056e-01 -5.73410232e-01
  -1.41924890e-01  1.60884609e-01  8.42062106e-02 -2.94710053e-01
   1.50441434e-01  2.01152530e-01 -1.85594647e-02 -5.80390613e-02
   1.39595006e-01 -2.31259943e-01  1.32600886e-02  1.60821086e-03
   1.77638619e-03 -9.79296328e-03 -9.05215355e-03  7.71755717e-03
  -3.12107028e-02  1.56311888e-02  5.98177179e-02 -5.05449015e-02
   9.00614773e-02  7.20987261e-02]
 [ 1.70393451e-01  2.32715896e-01  1.54779718e-01 -2.74693632e-02
  -2.79968156e-01  6.91975186e-02 -2.34635340e-02 -1.17460157e-01
  -1.45322810e-01  4.35048658e-02 -2.06885680e-01 -2.63456509e-01
   1.00401699e-02 -4.91755932e-01 -1.68209315e-01  1.89830896e-01
  -8.24647717e-03  1.00474235e-01  2.42448176e-01 -1.91562235e-03
   3.15813441e-03  1.53955481e-02  4.65360884e-02  4.97276317e-02
  -1.73164553e-01 -1.21777792e-01 -9.10387102e-03  4.52876920e-02
  -4.61098220e-01  1.40386572e-01]
 [ 1.53589790e-01  1.97207283e-01  1.76463743e-01  1.31687997e-03
  -3.53982091e-01  5.63432386e-02  2.08823790e-01 -6.05665008e-02
   3.58107079e-01 -1.41276243e-01  3.49517943e-01  2.51146975e-01
   1.58783192e-01 -1.34586924e-01 -2.50471408e-01 -1.25420649e-01
   8.46167156e-02 -1.95485228e-04 -1.26381025e-01  8.92652653e-03
   1.60785207e-02 -5.82097800e-03 -8.42247975e-02 -9.14549680e-02
  -1.59399802e-02  1.88205036e-01  3.87542329e-01  2.05212693e-01
   6.69461742e-02 -6.30479298e-02]
 [ 1.83417397e-01  1.30321560e-01  2.24657567e-01  7.40673350e-02
  -1.95548089e-01 -3.12244482e-02  3.69645937e-01  1.08319309e-01
   2.72519886e-01  8.62408470e-02 -3.42375908e-01 -6.45875122e-03
  -4.94026741e-01  1.99666719e-01 -6.20793442e-02 -1.98810346e-01
   1.08132263e-01  4.60549116e-02  1.21642969e-02  2.16019727e-03
  -2.39377870e-02  2.90093001e-02 -1.11655093e-02  1.79419192e-02
   1.29546547e-01 -1.09668978e-01 -3.51755074e-01  7.25453753e-02
   6.88682942e-02 -3.43753236e-02]
 [ 4.24984216e-02  1.83848000e-01  2.88584292e-01  4.40733510e-02
   2.52868765e-01  4.90245643e-01  8.03822539e-02 -2.20149279e-01
  -3.04077200e-01 -3.16529830e-01 -1.87844043e-01  3.20571348e-01
   1.03327412e-02  4.68643826e-02  1.13383199e-01 -1.57711497e-01
  -2.74059129e-01  1.87014764e-01  8.90392949e-02 -3.29389752e-04
  -5.22329189e-03  7.63652550e-03 -1.99759830e-02  1.72678486e-02
   1.95149333e-02  3.22620011e-03  4.23628949e-02  8.46544307e-02
   1.07385289e-01  9.76995265e-02]
 [ 1.02568322e-01  2.80092027e-01  2.11503764e-01  1.53047496e-02
  -2.63297438e-01 -5.31952674e-02 -1.91394973e-01 -1.11681884e-02
  -2.13722716e-01  3.67541918e-01  2.50624789e-01  2.76165974e-01
  -2.40458323e-01 -1.45652466e-01  3.53232211e-01  2.68553878e-01
  -1.22733398e-01 -5.98230982e-02 -8.66008430e-02 -1.79895682e-03
  -8.34191154e-03 -1.97564555e-02 -1.20365640e-02 -3.54889745e-02
   8.41712034e-02  7.51944193e-02 -8.57810992e-02 -2.44705083e-01
   2.22345297e-01 -6.28432814e-02]
 [ 2.27996634e-01 -2.19866379e-01 -4.75069900e-02  1.54172396e-02
   4.40659209e-03 -2.90684919e-04  9.70993602e-03 -4.26194163e-02
  -1.12141463e-01  7.73616428e-02  1.05067333e-01  3.96796652e-02
  -1.37890527e-01 -2.31012813e-02 -1.66567074e-01 -8.15605686e-02
  -2.40049982e-01 -2.16101353e-01 -1.36613039e-02  1.35643056e-01
  -6.35724917e-01 -4.12639581e-01 -1.78666740e-01  1.97054744e-01
  -7.07097238e-02 -1.56830365e-01  5.56767923e-02  9.62982088e-02
  -5.62690874e-03 -7.29389953e-03]
 [ 1.04469325e-01 -4.54672983e-02 -4.22978228e-02 -6.32807885e-01
   9.28834001e-02 -5.00080613e-02 -9.87074388e-03 -3.62516360e-02
   1.03341204e-01  2.95509413e-02  1.31572736e-02  7.97974499e-02
  -8.01454315e-02 -5.34307917e-02 -1.01115399e-01  1.85557852e-01
   6.93651855e-02  5.83984505e-02  7.58669276e-02 -1.02053601e-03
   1.72354925e-02  3.90250926e-04  2.14106944e-02 -3.64694332e-02
   1.18189721e-01 -1.18484602e-01  8.92289971e-03  1.11112024e-01
   3.00599798e-01  5.94440143e-01]
 [ 2.36639681e-01 -1.99878428e-01 -4.85465083e-02  1.38027944e-02
  -7.45415100e-03  8.50098715e-03  4.45726717e-04 -3.05585340e-02
  -1.09614364e-01  5.05083335e-02  5.10762807e-02 -8.98773800e-03
  -9.69657077e-02 -1.22193824e-02 -1.82755198e-01 -5.48570473e-02
  -2.34164147e-01 -1.88543592e-01 -9.08132490e-02 -7.97438536e-02
   2.29218029e-02  7.28680898e-01 -2.41031046e-01  2.44103670e-01
  -1.18034029e-01  2.37113167e-01 -6.33448296e-02 -1.72216251e-02
   1.10038577e-02  9.20235990e-02]
 [ 2.24870533e-01 -2.19351858e-01 -1.19023182e-02  2.58947492e-02
   2.73909030e-02 -2.51643821e-02 -6.78316595e-02 -7.93942456e-02
  -8.07324609e-02  6.99211523e-02  1.84598937e-01  4.80886567e-02
  -1.01160611e-01  6.68546458e-03 -3.14993600e-01 -9.06533944e-02
  -2.73399584e-01 -1.42064856e-01  4.10047202e-01 -3.97422838e-02
   4.44935933e-01 -2.38960316e-01  2.37162466e-01 -2.31359525e-01
   3.82899511e-02  1.44063033e-01 -1.90889625e-01  9.69598236e-02
   6.00473870e-02 -1.46790132e-01]
 [ 1.27952561e-01  1.72304352e-01 -2.59797613e-01  1.76522161e-02
   3.24435445e-01 -3.69255370e-01  1.08830886e-01 -2.05852191e-01
   1.12315904e-01 -1.28304659e-01  1.43890349e-01  5.65148662e-02
  -2.05130344e-01 -1.62235443e-01 -4.61258656e-02  1.45551659e-01
  -2.78030197e-01  5.01551675e-01 -2.34513845e-01 -4.58327731e-03
   7.38549171e-03  1.53524821e-03 -4.08535683e-02 -1.26024637e-02
   4.79647647e-02 -1.09901386e-02 -9.36901494e-02  6.82540931e-02
  -1.29723903e-01 -1.64849237e-01]
 [ 2.10095880e-01  1.43593173e-01 -2.36075625e-01 -9.13284153e-02
  -1.21804107e-01  4.77057929e-02 -1.40472938e-01 -8.40196588e-02
  -1.00677822e-01 -1.72133632e-01 -1.97420469e-01 -3.71662503e-01
   1.22793095e-02 -1.66470250e-01  4.99560142e-02 -1.53734861e-01
  -4.03712272e-03 -7.35745143e-02 -2.02007041e-02  1.28415624e-02
   3.56690393e-06 -4.86918180e-02 -7.05054136e-02  1.00463424e-01
   6.24384938e-01  1.86749953e-01  1.47920925e-01 -2.96764124e-02
   2.29280589e-01 -1.81374867e-01]
 [ 2.28767533e-01  9.79641143e-02 -1.73057335e-01 -7.39511797e-02
  -1.88518727e-01  2.83792555e-02  6.04880561e-02 -7.24678714e-02
   1.61908621e-01 -3.11638520e-01  1.85016760e-01 -8.70345324e-02
   2.17984329e-01  6.67989309e-02  2.04835886e-01 -2.15021948e-01
  -1.91313419e-01 -1.03907980e-01  4.57861197e-02 -4.02139168e-04
  -1.26757226e-02  1.76408967e-02 -1.42905801e-01 -2.66853781e-01
  -1.15770341e-01 -2.88852570e-01 -2.86433135e-01 -4.60426186e-01
  -4.64827918e-02  1.32100595e-01]
 [ 2.50885971e-01 -8.25723507e-03 -1.70344076e-01  6.00699571e-03
  -4.33320687e-02 -3.08734498e-02  1.67966619e-01  3.61707954e-02
   6.04884615e-02 -7.66482910e-02 -1.17772055e-01 -6.81253543e-02
  -2.54387490e-01  2.76418891e-01  1.69499607e-01  1.78141741e-01
  -7.54853164e-02  7.58138963e-02  2.60229625e-01  2.28844179e-03
   3.52404543e-02 -2.24756680e-02  2.30901389e-01  1.33574507e-01
  -2.63196337e-01  1.07340243e-01  5.67527797e-01 -2.99840557e-01
   3.30223397e-02 -8.86081478e-04]
 [ 1.22904556e-01  1.41883349e-01 -2.71312642e-01 -3.62506947e-02
   2.44558663e-01  4.98926784e-01  1.84906298e-02 -2.28225053e-01
   6.46378061e-02 -2.95630751e-02  1.57560248e-01  4.40335026e-02
  -2.56534905e-01 -5.35557351e-03 -1.39888394e-01  2.57894009e-01
   4.30658116e-01 -2.78713843e-01 -1.17250532e-01 -3.95443454e-04
   1.34042283e-02 -4.92048082e-03  2.27904438e-02 -2.81842956e-02
  -4.52996243e-02 -1.43818093e-02 -1.21343451e-01 -9.71448437e-02
  -1.16759236e-01 -1.62708549e-01]
 [ 1.31783943e-01  2.75339469e-01 -2.32791313e-01 -7.70534703e-02
  -9.44233510e-02 -8.02235245e-02 -3.74657626e-01 -4.83606666e-02
  -1.34174175e-01  1.26095791e-02  1.18283551e-01 -3.47316933e-02
  -1.72814238e-01  2.12104110e-01  2.56173195e-01 -4.05556492e-01
   1.59394300e-01  2.35647497e-02  1.14944811e-02 -1.89429245e-03
   1.14776603e-02  2.35621424e-02  5.99859979e-02 -4.52048188e-03
  -2.80133485e-01  3.78254532e-02 -7.62533821e-03  4.69471147e-01
  -1.04991974e-01  9.23439434e-02]]

 Eigen Values 
%s [1.33049908e+01 5.70137460e+00 2.82291016e+00 1.98412752e+00
 1.65163324e+00 1.20948224e+00 6.76408882e-01 4.77456255e-01
 4.17628782e-01 3.51310875e-01 2.94433153e-01 2.61621161e-01
 2.41782421e-01 1.57286149e-01 9.43006956e-02 8.00034045e-02
 5.95036135e-02 5.27114222e-02 4.95647002e-02 1.33279057e-04
 7.50121413e-04 1.59213600e-03 6.91261258e-03 8.19203712e-03
 1.55085271e-02 1.80867940e-02 2.43836914e-02 2.74877113e-02
 3.12142606e-02 3.00256631e-02]
In [183]:
# Display the eigen values
print("Eigen Values:")
pd.DataFrame(eig_vals).transpose()
Eigen Values:
Out[183]:
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
0 13.304991 5.701375 2.82291 1.984128 1.651633 1.209482 0.676409 0.477456 0.417629 0.351311 ... 0.00075 0.001592 0.006913 0.008192 0.015509 0.018087 0.024384 0.027488 0.031214 0.030026

1 rows × 30 columns

Selecting k eigenvectors

4.Select k eigenvectors that correspond to the k largest eigenvalues, where k is the dimensionality of the new feature subspace (k≤d)

In [184]:
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)
Cumulative Variance Explained [ 44.27202561  63.24320765  72.63637091  79.23850582  84.73427432
  88.75879636  91.00953007  92.59825387  93.98790324  95.15688143
  96.13660042  97.00713832  97.81166331  98.33502905  98.64881227
  98.91502161  99.1130184   99.28841435  99.45333965  99.55720433
  99.65711397  99.74857865  99.82971477  99.88989813  99.94150237
  99.96876117  99.99176271  99.99706051  99.99955652 100.        ]
In [185]:
# Ploting 
plt.figure(figsize=(10 , 5))
plt.bar(range(1, eig_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, eig_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()

Plot indicates that the first principal component accounts for around 45 percent of the variance. Also, we can see that the first two principal components combined explain almost 65 percent of the variance in the data.Also to choose Eigenvector,we see the values grater than 1 in Eigen Values. So best way out is to choose a right balance between two; Eigen value near to 1 combined with maximum variance.So Next, we collect the 5 eigenvectors that correspond to the five largest values to capture about 85 percent of the variance in this dataset. Here we only chose five eigenvectors for the purpose of dimension reduction

In [186]:
#Selecting k eigenvectors that correspond to the k largest eigenvalues, where k is the dimensionality of 
#the new feature subspace ( k≤d ).
eigen_pairs = [(np.abs(eig_vals[i]), eig_vecs[ :, i]) for i in range(len(eig_vals))]
In [187]:
# Projection Matrix


matrix_w = np.hstack((eigen_pairs[0][1].reshape(30,1), 
                      eigen_pairs[1][1].reshape(30,1),
                      eigen_pairs[2][1].reshape(30,1),
                      eigen_pairs[3][1].reshape(30,1),
                      eigen_pairs[4][1].reshape(30,1),
                     ))

print('Matrix W:\n', matrix_w)
Matrix W:
 [[ 0.21890244 -0.23385713 -0.00853124  0.04140896 -0.03778635]
 [ 0.10372458 -0.05970609  0.0645499  -0.60305     0.04946885]
 [ 0.22753729 -0.21518136 -0.00931422  0.0419831  -0.03737466]
 [ 0.22099499 -0.23107671  0.02869953  0.0534338  -0.01033125]
 [ 0.14258969  0.18611302 -0.1042919   0.15938277  0.36508853]
 [ 0.23928535  0.15189161 -0.07409157  0.03179458 -0.01170397]
 [ 0.25840048  0.06016536  0.00273384  0.01912275 -0.08637541]
 [ 0.26085376 -0.0347675  -0.02556354  0.06533594  0.04386103]
 [ 0.13816696  0.19034877 -0.04023994  0.06712498  0.30594143]
 [ 0.06436335  0.36657547 -0.02257409  0.04858676  0.04442436]
 [ 0.20597878 -0.10555215  0.26848139  0.09794124  0.1544565 ]
 [ 0.01742803  0.08997968  0.37463367 -0.35985553  0.19165051]
 [ 0.21132592 -0.08945723  0.26664537  0.08899241  0.12099022]
 [ 0.20286964 -0.15229263  0.21600653  0.10820504  0.12757443]
 [ 0.01453145  0.20443045  0.30883898  0.04466418  0.23206568]
 [ 0.17039345  0.2327159   0.15477972 -0.02746936 -0.27996816]
 [ 0.15358979  0.19720728  0.17646374  0.00131688 -0.35398209]
 [ 0.1834174   0.13032156  0.22465757  0.07406733 -0.19554809]
 [ 0.04249842  0.183848    0.28858429  0.04407335  0.25286876]
 [ 0.10256832  0.28009203  0.21150376  0.01530475 -0.26329744]
 [ 0.22799663 -0.21986638 -0.04750699  0.01541724  0.00440659]
 [ 0.10446933 -0.0454673  -0.04229782 -0.63280788  0.0928834 ]
 [ 0.23663968 -0.19987843 -0.04854651  0.01380279 -0.00745415]
 [ 0.22487053 -0.21935186 -0.01190232  0.02589475  0.0273909 ]
 [ 0.12795256  0.17230435 -0.25979761  0.01765222  0.32443545]
 [ 0.21009588  0.14359317 -0.23607563 -0.09132842 -0.12180411]
 [ 0.22876753  0.09796411 -0.17305734 -0.07395118 -0.18851873]
 [ 0.25088597 -0.00825724 -0.17034408  0.006007   -0.04333207]
 [ 0.12290456  0.14188335 -0.27131264 -0.03625069  0.24455866]
 [ 0.13178394  0.27533947 -0.23279131 -0.07705347 -0.09442335]]

Transform the d-dimensional input dataset x using the projection matrix W to obtain the new k-dimensional feature subspace. We will use the 31×5 -dimensional projection matrix W to transform our samples onto the new subspace via the linear combination equation Y=X×W, where Y is a 569×5 matrix of our transformed samples.

In [188]:
# Projection Onto the New Feature Space

Y = cancer_data_std.dot(matrix_w)
In [189]:
# There is PCA already implementation in scikit-learn. 
# I am using this to visualize the interactions between the top 3 principal components

from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA

fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)
X_reduced = PCA(n_components=3).fit_transform(df)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=df.iloc[:,0].values,
           cmap=plt.cm.Set1, edgecolor='k', s=40)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])

plt.show()

Steps followed to design above PCA

Begins by standardizing the data. Data on all the dimensions are subtracted from their means to shift the data points to the origin. i.e. the data is centered on the origins

Generate the covariance matrix / correlation matrix for all the dimensions

Perform eigen decomposition, that is, compute eigen vectors which are the principal components and the corresponding eigen values which are the magnitudes of variance captured

Sort the eigen pairs in descending order of eigen values and select he one with the largest value. This is the first principal component that covers the maximum information from the original data

Here I show the correlation plot of ‘worst’ values of the features.

In [202]:
feature_worst=list(df.columns[20:31])# select the 'worst' features
s=sns.heatmap(df[feature_worst].corr(),cmap='coolwarm') # fantastic tool to study the features 
s.set_yticklabels(s.get_yticklabels(),rotation=30,fontsize=7)
s.set_xticklabels(s.get_xticklabels(),rotation=30,fontsize=7)
plt.show()
# no point in finding correlation between radius with perimeter and area. 

Developing ML models on PCA

In [214]:
df=pd.read_excel("CaseStudy_Cancer.xls")
df.drop('ID',axis =1,inplace=True)
cleanup_nums = {"B-M":     {"M": 1, "B": 0}}
df.replace(cleanup_nums, inplace=True)
In [215]:
from sklearn.model_selection import train_test_split
X = df.drop(['B-M'], axis=1)
Y = df[['B-M']]
test_size = 0.30
seed = 7  
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
In [216]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit on training set only.
scaler.fit(X_train)
Out[216]:
StandardScaler(copy=True, with_mean=True, with_std=True)
In [217]:
# Apply transform to both the training set and the test set.
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
In [230]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
In [231]:
pca.fit(X_train)
Out[231]:
PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
In [232]:
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
In [233]:
cancer_model = DecisionTreeClassifier(criterion = 'entropy', class_weight={0:.80,1:.20}, max_depth = 9, min_samples_leaf=5 )
In [234]:
cancer_model.fit(X_train,y_train)
Out[234]:
DecisionTreeClassifier(class_weight={0: 0.8, 1: 0.2}, criterion='entropy',
            max_depth=9, max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
In [235]:
cancer_model.score(X_test, y_test)
Out[235]:
0.935672514619883
In [236]:
predictions = cancer_model.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
             precision    recall  f1-score   support

          0       0.93      0.97      0.95       116
          1       0.94      0.85      0.90        55

avg / total       0.94      0.94      0.93       171

In [237]:
print(confusion_matrix(y_test,predictions))
[[113   3]
 [  8  47]]
In [238]:
y_train = np.ravel(y_train)   # converting y_train vector to single dimensional array
rfcl = RandomForestClassifier(random_state=1)

bgcl = BaggingClassifier(base_estimator=cancer_model, n_estimators=20)  #the base_estimator can be null.
#The bagging classifer  will build it's own tree

enclf = VotingClassifier(estimators = [('rf', rfcl), ('bg', bgcl)], voting = 'hard')

for clf, label in zip([rfcl, enclf, bgcl], ['RandomForest', 'Ensemble', 'Bagging']):
    clf.fit(X_train, y_train)
    y_predict = clf.predict(X_test)
    print(metrics.classification_report(y_test, y_predict))
             precision    recall  f1-score   support

          0       0.95      0.97      0.96       116
          1       0.94      0.89      0.92        55

avg / total       0.95      0.95      0.95       171

             precision    recall  f1-score   support

          0       0.93      0.98      0.95       116
          1       0.96      0.84      0.89        55

avg / total       0.94      0.94      0.93       171

             precision    recall  f1-score   support

          0       0.92      0.97      0.95       116
          1       0.94      0.82      0.87        55

avg / total       0.92      0.92      0.92       171

In [239]:
rfcl = RandomForestClassifier(criterion = 'entropy', class_weight={0:.5,1:.5}, max_depth = 5, min_samples_leaf=5)
rfcl = rfcl.fit(X_train, y_train)
In [241]:
test_pred = rfcl.predict(X_test)
rfcl.score(X_test , y_test)
Out[241]:
0.935672514619883
In [242]:
rf_model = RandomForestClassifier(n_estimators = 350, criterion = 'entropy')
rf_model = rf_model.fit(X_train, y_train)
test_pred = rf_model.predict(X_test)
rf_model.score(X_test , y_test)
Out[242]:
0.9532163742690059
In [243]:
lrcl = LogisticRegression(random_state=1)
rfcl = RandomForestClassifier(random_state=1)
nbcl = GaussianNB()
bgcl = BaggingClassifier(base_estimator=cancer_model, n_estimators=90)  #the base_estimator can be null. The bagging classifer  will build it's own tree

enclf = VotingClassifier(estimators = [('lor', lrcl), ('rf', rfcl), ('nb', nbcl), ('bg', bgcl)], voting = 'hard')
In [244]:
for clf, label in zip([lrcl , rfcl, nbcl, enclf, bgcl], ['Logistic Regression', 'RandomForest', 'NaiveBayes', 'Ensemble', 'Bagging']):
    scores = cross_val_score(clf, X, Y, cv=5, scoring='accuracy')
    print("Accuracy: %0.02f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label ))
Accuracy: 0.95 (+/- 0.02) [Logistic Regression]
Accuracy: 0.95 (+/- 0.02) [RandomForest]
Accuracy: 0.94 (+/- 0.02) [NaiveBayes]
Accuracy: 0.95 (+/- 0.02) [Ensemble]
Accuracy: 0.95 (+/- 0.02) [Bagging]

Comparing Models and advantage of PCA

Anyway, from the cancer data-set we see that it has 30 features, so let’s reduce it to only 3 principal features and then we can visualize the scatter plot of these new independent variables.

Well, I guess we got better results(more than 2%) now for Decision Tree. Let’s to examine it carefully.

The confusion matrix shows really good results this time, the DT is committing less misclassification in both classes, it can be seen though the values of the main diagonal and also the accuracy value is around 95%. It means that the classifier has 95% chance of correctly classifying a new unseen example. For classification problems, it is a not bad result at all.

Let’s think about this example: We have a dataset composed by a set of properties from cancers. These properties describe each cancer by its radius,texture,perimeter,area,smoothness,compactness,concavity,concave points,Symmetry,fractal dimension,SE-radius and so on. However, many of these features will measure related properties and so will be redundant. Therefore, we should remove these redundancy and describe each car with less properties. This is exactly what PCA aims to do.Recall that PCA does not take information of classes into account, it just look at the variance of each feature because is reasonable assumes that features that present high variance are more likely to have a good split between classes.

Often, people end up making a mistake in thinking that PCA selects some features out of the dataset and discards others. The algorithm actually constructs new set of properties based on combination of the old ones. Mathematically speaking, PCA performs a linear transformation moving the original set of features to a new space composed by principal component. These new features does not have any real meaning for us, only algebraic, therefore do not think that combining linearly features, you will find new features that you have never thought that it could exist.

Advantage------

1-Reduced the storage space needed 2-Speed up computation 3-Remove redudant features 4-Reducing data dimension to 2D or 3D , so it may allow us to plot and visualize it 5-Too many features or too complex model can lead to overfitting

Conclusion----

Dimensionality Reduction plays a really important role in machine learning, especially when you are working with thousands of features. Principal Components Analysis are one of the top dimensionality reduction algorithm, it is not hard to understand and use it in real projects. This technique, in addition to making the work of feature manipulation easier, it still helps to improve the results of the classifier,

In [ ]: